NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Vulnerability Detection with Code Language Models: How Far Are We?

Ding, Yangruibo; Fu, Yanjun; Ibrahim, Omniyyah; Sitawarin, Chawin; Chen, Xinyun; Alomair, Basel; Wagner, David; Ray, Baishakhi; Chen, Yizheng (April 2025, 47th International Conference on Software Engineering)

In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PRIMEVUL, a new dataset for training and evaluating code LMs for vulnerability detection. PRIMEVUL incorporates a novel set of data labeling techniques that achieve comparable label accuracy to humanverified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs’ performance in real-world conditions. Evaluating code LMs on PRIMEVUL reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PRIMEVUL. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.
more » « less
Full Text Available
SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning

Ding, Yangruibo; Peng, Jinjun; Min, Marcus; Kaiser, Gail; Yang, Junfeng; Ray, Baishakhi (December 2024, Advances in Neural Information Processing Systems, NeurIPS 2024)

Full Text Available
PropTest: Automatic Property Testing for Improved Visual Programming

https://doi.org/10.18653/v1/2024.findings-emnlp.483

Koo, Jaywon; Yang, Ziyan; Cascante-Bonilla, Paola; Ray, Baishakhi; Ordonez, Vicente (November 2024, Findings of the Association for Computational Linguistics)

Full Text Available
kGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution

Mathai, Alex; Huang, Chenxi; Maniatis, Petros; Nogikh, Aleksandr; Ivancic, Franjo; Yang, Junfeng; Ray, Baishakhi (December 2024, Conference on Neural Information Processing Systems (NeurIPS))

Full Text Available
SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning

Ding, Yangruibo; Peng, Jinjun; Min, Marcus J; Kaiser, Gail; Yang, Junfeng; Ray, Baishakhi (September 2024, OpenReview.net)

Code Large Language Models (Code LLMs) have excelled at tasks like code completion but often miss deeper semantics such as execution effects and dynamic states. This paper aims to bridge the gap between Code LLMs' reliance on static text data and the need for semantic understanding for complex tasks like debugging and program repair. We introduce a novel strategy, monologue reasoning, to train Code LLMs to reason comprehensive semantics, encompassing high-level functional descriptions, local execution effects of individual statements, and overall input/output behavior, thereby linking static code text with dynamic execution states. We begin by collecting PyX, a clean Python corpus of fully executable code samples with functional descriptions and test cases. We propose training Code LLMs not only to write code but also to understand code semantics by reasoning about key properties, constraints, and execution behaviors using natural language, mimicking human verbal debugging, i.e., rubber-duck debugging. This approach led to the development of SemCoder, a Code LLM with only 6.7B parameters, which shows competitive performance with GPT-3.5-turbo on code generation and execution reasoning tasks. SemCoder achieves 79.3% on HumanEval (GPT-3.5-turbo: 76.8%), 63.6% on CRUXEval-I (GPT-3.5-turbo: 50.3%), and 63.9% on CRUXEval-O (GPT-3.5-turbo: 59.0%). We also study the effectiveness of SemCoder's monologue-style execution reasoning compared to concrete scratchpad reasoning, showing that our approach integrates semantics from multiple dimensions more smoothly. Finally, we demonstrate the potential of applying learned semantics to improve Code LLMs' debugging and self-refining capabilities. Our data, code, and models are available at: https://github.com/ARiSE-Lab/SemCoder.
more » « less
Full Text Available
CYCLE: Learning to Self-Refine the Code Generation

https://doi.org/10.1145/3649825

Ding, Yangruibo; Min, Marcus J; Kaiser, Gail; Ray, Baishakhi (April 2024, Proceedings of the ACM on Programming Languages)

Pre-trained code language models have achieved promising performance in code generation and improved the programming efficiency of human developers. However, their self-refinement capability is typically overlooked by the existing evaluations of code LMs, which focus only on the accuracy of the one-time prediction. For the cases when code LMs fail to implement the correct program, developers actually find it hard to debug and fix the faulty prediction since it is not written by the developers themselves. Unfortunately, our study reveals that code LMs cannot efficiently self-refine their faulty generations as well. In this paper, we propose CYCLE framework, learning to self-refine the faulty generation according to the available feedback, such as the execution results reported by the test suites. We evaluate CYCLE on three popular code generation benchmarks, HumanEval, MBPP, and APPS. The results reveal that CYCLE successfully maintains, sometimes improves, the quality of one-time code generation, while significantly improving the self-refinement capability of code LMs. We implement four variants of CYCLE with varied numbers of parameters across 350M, 1B, 2B, and 3B, and the experiments show that CYCLE consistently boosts the code generation performance, by up to 63.5
more » « less
Full Text Available
Towards Causal Deep Learning for Vulnerability Detection

https://doi.org/10.1145/3597503.3639170

Rahman, Md Mahbubur; Ceka, Ira; Mao, Chengzhi; Chakraborty, Saikat; Ray, Baishakhi; Le, Wei (April 2024, ACM)

Full Text Available
Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

Min, Marcus J; Ding, Yangruibo; Buratti, Luca; Pujar, Saurabh; Kaiser, Gail; Jana, Suman; Ray, Baishakhi (April 2024, OpenReview)

Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates the performance of Code LLMs on a set of individual tasks, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications for its own code and generating code for its own specifications. Failure to preserve self-consistency reveals a lack of understanding of the shared semantics underlying natural language and programming language, and therefore undermines the trustworthiness of a model. In this paper, we first formally define the self-consistency of Code LLMs and then design a framework, IdentityChain, which effectively and efficiently evaluates the self-consistency and conventional accuracy of a model at the same time. We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from conventional accuracy. Furthermore, we show that IdentityChain can be used as a model debugging tool to expose weaknesses of Code LLMs by demonstrating three major weaknesses that we identify in current models using IdentityChain. Our code is available at https://github.com/marcusm117/IdentityChain.
more » « less
Full Text Available
TRACED: Execution-aware Pre-training for Source Code

https://doi.org/10.1145/3597503.3608140

Ding, Yangruibo; Steenhoek, Benjamin; Pei, Kexin; Kaiser, Gail; Le, Wei; Ray, Baishakhi (February 2024, ACM)

Most existing pre-trained language models for source code focus on learning the static code text, typically augmented with static code structures (abstract syntax tree, dependency graphs, etc.). However, program semantics will not be fully exposed before the real execution. Without an understanding of the program execution, statically pre-trained models fail to comprehensively capture the dynamic code properties, such as the branch coverage and the runtime variable values, and they are consequently less effective at code understanding tasks, such as retrieving semantic clones and detecting software vulnerabilities. To close the gap between the static nature of language models and the dynamic characteristics of programs, we introduce TRACED, an execution-aware pre-training strategy for source code. Specifically, we pre-train code language models with a combination of source code, executable inputs, and corresponding execution traces. Our goal is to teach code models the complicated execution logic during the pre-training, enabling the model to statically estimate the dynamic code properties without repeatedly executing code during task-specific fine-tuning. To illustrate the effectiveness of our proposed approach, we fine-tune and evaluate TRACED on three downstream tasks: static execution estimation, clone retrieval, and vulnerability detection. The empirical results show that TRACED relatively improves the statically pre-trained code models by 12.4% for complete execution path prediction and by 25.2% for runtime variable value predictions. TRACED also significantly outperforms statically pre-trained models in clone retrieval and vulnerability detection across four public benchmarks.
more » « less
Full Text Available
Neural Network Guided Evolutionary Fuzzing for Finding Traffic Violations of Autonomous Vehicles

https://doi.org/10.1109/TSE.2022.3195640

Zhong, Ziyuan; Kaiser, Gail; Ray, Baishakhi (April 2023, IEEE Transactions on Software Engineering)

Full Text Available

« Prev Next »

Search for: All records